Lets start by looking at the data summary
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
From this summary we can see some broad categories like: acidity, sugar, chemical groups, quality, alcohol content.
Lets start by plotting the quality
This looks like a normal distribution.
To continue this analysis further, lets look at the: density, alcohol levels and sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The density looks like a normal distribution and the alcohol data is a little skewed. We can see a large spike in the alcohol level around 9.5%.
Sugar seems to be skewed drastically, it would make sense to test it on a log scale.
Nothing significant can be seen here.
Now, lets look at the acidity.
pH seems to follow a normal distribution, with the largest concentration around 3.3.
Looks like the fixed and volatile acidity seems to skewed. But, no pattern is visible in case of the citric acid levels. So, lets further explore it.
It seems skewed when measured on a log scale.
Finally, lets explore the chemical levels.
These plots look like normal distributions if we remove the outliers.
Both distributions are skewed. # Univariate Analysis
The are 1599 different wine bottles and the dataset has 13 features (“fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”,“free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”,“quality”).
Some interesting observations: * Majority of the wines are rate a quality of 5 or 6. * The alcohol levels are skewed with a large spike at 9.5%. * The median pH values is at 3.31.
The main feature in this dataset is the quality.
The main features of interest are citric.acid, residual.sugar, ph and alcohol. It would be interesting to see how these variables effect the quality.
No.
Citric acid and Alcohol seem to be a little unusual. Alcohol seems to have a skewed distribution with a sudden did, it’s looks almost bimodal. While citric acid is skewed on the log scale along the x axis.
No aditional changes were made.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00000000 -0.256130895 0.6717034 0.114776724
## volatile.acidity -0.25613089 1.000000000 -0.5524957 0.001917882
## citric.acid 0.67170343 -0.552495685 1.0000000 0.143577162
## residual.sugar 0.11477672 0.001917882 0.1435772 1.000000000
## density 0.66804729 0.022026232 0.3649472 0.355283371
## pH -0.68297819 0.234937294 -0.5419041 -0.085652422
## alcohol -0.06166827 -0.202288027 0.1099032 0.042075437
## quality 0.12405165 -0.390557780 0.2263725 0.013731637
## density pH alcohol quality
## fixed.acidity 0.66804729 -0.68297819 -0.06166827 0.12405165
## volatile.acidity 0.02202623 0.23493729 -0.20228803 -0.39055778
## citric.acid 0.36494718 -0.54190414 0.10990325 0.22637251
## residual.sugar 0.35528337 -0.08565242 0.04207544 0.01373164
## density 1.00000000 -0.34169933 -0.49617977 -0.17491923
## pH -0.34169933 1.00000000 0.20563251 -0.05773139
## alcohol -0.49617977 0.20563251 1.00000000 0.47616632
## quality -0.17491923 -0.05773139 0.47616632 1.00000000
Lets draw a correlation plot to have a better understaing.
From the above table and plot matrix we see “fixed.acidity”, “volatile.acidity” and “pH” has some correlation with “citric.acid”. Interestingly, density has some correlation with “fixed.acidity” and “alcohol”. Also, “quality” has some correlation with “alcohol”.
Lets now look at pH, fixed.acidity and volatile.acidity versus citric.acid.
##
## Call:
## lm(formula = citric.acid ~ pH, data = analysis_winedata)
##
## Coefficients:
## (Intercept) pH
## 2.5350 -0.6838
From the scatter plot we can see that the data seems to be slightly negatively correlated.
##
## Call:
## lm(formula = citric.acid ~ fixed.acidity, data = analysis_winedata)
##
## Coefficients:
## (Intercept) fixed.acidity
## -0.35427 0.07515
From the scatter plot we can see that the data seems to be slightly positively correlated.
##
## Call:
## lm(formula = citric.acid ~ volatile.acidity, data = analysis_winedata)
##
## Coefficients:
## (Intercept) volatile.acidity
## 0.5882 -0.6011
This data looks very similar to pH vs citric acid levels. Maybe pH and volatile.acidity have some relationship. Let’s try to plot it.
##
## Call:
## lm(formula = pH ~ volatile.acidity, data = analysis_winedata)
##
## Coefficients:
## (Intercept) volatile.acidity
## 3.2042 0.2026
There definitly seems to be some sort of correlation here.
Now, lets look at denisty vs alcohol and density vs fixed.acidity.
##
## Call:
## lm(formula = density ~ alcohol, data = analysis_winedata)
##
## Coefficients:
## (Intercept) alcohol
## 1.0059059 -0.0008788
The general trend here seems to be that alcohol levels decrease with density. Which does make sense as alcohol is lighter than water and more alcohol means less water, hence lower density.
There is a clearcut linear relationship between fixed acidity and density. The acidity goes up with the density.
Now, lets more to the most interesting plot between alcohol and quality.
## $`3`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 9.96 0.82 9.93 10.02 0.78 8.4 11 2.6 -0.41 -0.99 0.26
##
## $`4`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 53 10.27 0.93 10 10.21 1.19 9 13.1 4.1 0.61 -0.23
## se
## X1 0.13
##
## $`5`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 681 9.9 0.74 9.7 9.79 0.44 8.5 14.9 6.4 1.83 5.25
## se
## X1 0.03
##
## $`6`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 638 10.63 1.05 10.5 10.56 1.19 8.4 14 5.6 0.54 -0.16
## se
## X1 0.04
##
## $`7`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 199 11.47 0.96 11.5 11.47 1.04 9.2 14 4.8 0.01 -0.47
## se
## X1 0.07
##
## $`8`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 18 12.09 1.22 12.15 12.12 1.19 9.8 14 4.2 -0.2 -0.98 0.29
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
There seems to be a positive correlation, except in the case of wines rates 5 in quality.
Most of the comparisons made with citric acid showed some type of linear realtionship.
The comparision between alcohol and density proved the hypothesis that wines having low alcohol levels have high concentration of water, hence lower higher in density as water is more dense.
Finally, quality and alcohol showed an increasing linear relationship. But, there is a suddent dip in case of wine with quality ‘5’.
As mentioned above the dip in quality vs alcohol is very intersting.
pH and fixed acidity seem to have the strongest correlation.
In the above plot of Alcohol vs Density vs Quality. We can see that alcohols rated 5 in quality are on the more denser while having low alcohol content.
No significant observations can be derived from this plot.
There are no interesting patterns here.
Clearly acidity varies negatively with the pH. But, the quality seems to be uniform.
From the first graph it seems to be that the density has a inverse relationship with quality. Denser the wine, lower it’s score.
No.
The above three graphs show how different acidity levels are distributed thoughout the dataset.
Both, fixed and volatile acidity level have a normal distribution, which is as expected.
Their seems to be spikes in the citric acid instead of the expected normal distributions.
This is boxplot of quality of wine versus alcohol content distributed as per their quality levels. The general expectation was to see a linear relationship between the two variables. That seems to be the general trend.
But, there seems to be a dip at quality ‘5’.
This is a Multivariate plot showing the relationship between Alcohol, Sugar and Quality.
We can see, even though the alcohol levels vary widely with sugar, there is no clear preference for wines with lower amount of residual sugar. The sugar levels are all over the graph.
This analysis was conducted conducted with the view of trying to uncover hidden insights by move a step at a time and proceeding further or retracting backwards based on the outcome. It was at times unbelievable at times when the hypothesis was incorrect, but it did make sense. The most important thing that influenced the direction on the analysis was some sort of patterns that unravelled.
In the future analysis, it would make sense to carry out analysis based on the chemical compositions.
The take aways from this analysis are that wines with high quality tend to have higher alcohol content and low residual sugar. Another interesting finding was that citric acidity decreases with pH levels. So, wines with lower acidty have higher citric acid content.
In conclusion, if you are looking for a good bottle of wine. It will most like have very little sweetness to it, but will be strong.